Morphology Based Automatic Acquisition of Large-coverage Lexica

نویسندگان

  • Lionel Clément
  • Benoît Sagot
  • Bernard Lang
چکیده

In this article, we introduce a new technique for constructing wide-coverage morphological lexica from large corpora and morphological knowledge, with an application to French. Basically, it relies on the idea that the existence of a hypothetical lemma can be guessed if several different words found in the corpus are best interpreted as morphological variants of this lemma. We first validated our technique by extracting verbs and adjectives on a general French corpus of 25 million words. Compared with other lexical resources available for French, our results are very satisfying, since we cover many words, often derived words, that are not always present in other lexica. Application of our algorithm to the acquisition of domain-specific adjectives on a botanic corpus gave also very good results, thus demonstrating its usability to extract domain-specific lexica. Moreover, it is generalizable to any language with a substantial morphology. Part of the resulting lexicon (currently verbal forms) is already freely available on http://www.lefff.net/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Specifications of Building Polish Lexica for Application in ASR and TTS Systems

This paper brings detailed information concerning the specifications of building Polish lexica of common and special application words for use in speech applications such as ASR (automatic speech recognition) or TTS (text-to-speech) synthesis. The specifications include information on the collection of text corpora and word lists, phonetic, grammatical and morphological annotation, as well as s...

متن کامل

Evaluating and improving syntactic lexica by plugging them within a parser

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon (Sagot, 2010), and their integration within the large-coverage TAG-based FRMG parser (de La Clergerie, 2005). The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations an...

متن کامل

ATOLL - A framework for the automatic induction of ontology lexica

There is a range of large knowledge bases, such as Freebase and DBpedia, as well as linked data sets available on the web, but they typically lack lexical information stating how the properties and classes they comprise are realized lexically. Often only one label is attached, if at all, thus lacking rich linguistic information, e.g. about morphological forms, syntactic arguments or possible le...

متن کامل

Spanish Lexical Acquisition via Morpho-Semantic Constructive Derivational Morphology

This paper describes an algorithm for Spanish derivational morphology whose output is generalizable to two different lexicon acquisition situations. One is the process of automatic lexicon acquisition via the use of Morpho-Semantic Lexical Rules (MSLRs), (Viegas, Gonzalez, & Longwell 1996) usable in semantically based Natural Language Processing(Nirenburg, et al 1996) in order to considerably r...

متن کامل

The Role of Morphology in Generating High-Quality Pronunciation Lexica for Regional Variants of Portuguese

Grapheme to phoneme (GTP) systems for languages such as English, German, and Korean have been shown to achieve better performance rates with the inclusion of a morpho-phonological preprocessing component. While semiautomatic and automatic GTP approaches for Portuguese continue to achieve steady gains, such algorithms do not take morphology into account, despite a growing need to do so, based in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004